GitHub

This is the codebase for our paper A New Notion of Individually Fair Clustering: $\alpha$-Equitable k-Center.

Overview

There are three main components to the codebase: the implementation of the clustering algorithms, the main file to execute these implementations, and the main file to generate plots from the execution of the algorithms. Additionally, there is an auxillary component which helps generate/preprocess the data for clustering and time the experiments.

Component	Description	Relevant files
Algorithm Implementation	Implementation of Algorithms 1, 2, and 3 from our paper as well as relevant modifications for experiments. Also includes implementation of the Hochbaum-Shmoys clustering algorithm as well as the Gonzalez algorithm as implemented by (Brubach et al. 2020)	equitable_clustering.py
Simulation	Uses the algorithm implementation to generate clusterings and analyze fairness	main.py
Plots	Generates plots from the output of the experiments	generate_plots.py
Utilities	Has helper functions for timing code	timer.py, data/data_preparation.py

Dependencies

numpy==1.20.3
scipy==1.6.3
matplotlib==3.4.2
pandas==1.2.4
cycler==0.10.0

Data Preparation

We use (Anderson et al. 2020)'s codebase to prepare our data; all code in data is from their codebase. The sample files we used for our experiments can be found in data/processed (data/adult/adult.pkl, data/processed/bank/bank.pkl, data/processed/creditcard/creditcard.pkl, data/processed/census1990/census1990.pkl, data/processed/diabetes/diabetes.pkl). These each contain at most 25000 points (because that is the maximum number of points we use in our cluster computations), and were generated using data/data_preparation.py as provided in their codebase.

Simulation

In order to run Alg-AG, Alg-PP, and Pseudo-PoF-Alg for a given value of k, k_val and dataset dataset_name:

python --k k_val --sample_file data/processed/dataset_name/dataset_name_1.pkl

In order to allow for quick modification of algorithms without having to run the Hochbaum-Shmoys clustering algorithm, the Hochbaum-Shmoy algorithm's results are stored after running. They have been included here so that the results presented in the paper can be exactly reproduced here.

Argument	Functionality	Values
sample_file	Name of the sample dataset to be used for the experiment (if generating new samples, note that the sample dataset must start with the dataset type so that the main file can figure out the default name to be used for the output_directory)	`data/processed/adult/adult_1.pkl` `data/processed/bank/bank_1.pkl` `data/processed/creditcard/creditcard_1.pkl` `data/processed/census1990/census1990_1.pkl` `data/processed/diabetes/diabetes_1.pkl`
k	The value of k to be used. Defaults to 2.	Integer
seed	The random seed to be used. Defaults to 123 (used in the paper's experiments).	Integer
output_directory	Specifies where to dump the assignments and analysis from the simulations. By default will be set to `output` so that experiments for a given dataset `<dataset` will be dumped to `output/<dataset>` where `<dataset>` is one of "adult", "bank", "creditcard", "census1990", or "diabetes"	String

Plots

In order to generate plots for a given list of values of k, <list_of_k_vals> and a given list of datasets, <list_of_datasets>:

python generate_plots.py --k <list_of_k_vals> --dataset <list_of_datasets>

Argument	Functionality	Values
k	The values of k to be plotted. Defaults to [2, 4, 8, 16, 32, 64, 128] (used in the paper's experiments).	List[Integer]
dataset	The datasets for which plots should be generated. Defaults to ["adult", "bank", "creditcard", "census1990", "diabetes"] (used in the paper's experiments).	List[String]
output_directory	The output directory in which to search for the results of the experiments. Defaults to `output` so that for a given , the results are searched for in `output/<dataset>`.	String

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

output

output

README.md

README.md

equitable_clustering.py

equitable_clustering.py

generate_plots.py

generate_plots.py

main.py

main.py

timer.py

timer.py

Repository files navigation

Overview

Dependencies

Data Preparation

Simulation

Plots

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
data		data
output		output
README.md		README.md
equitable_clustering.py		equitable_clustering.py
generate_plots.py		generate_plots.py
main.py		main.py
timer.py		timer.py

chakrabarti/equitable_clustering

Folders and files

Latest commit

History

Repository files navigation

Overview

Dependencies

Data Preparation

Simulation

Plots

About

Resources

Stars

Watchers

Forks

Languages